Accelerating Content-Defined Chunking for Data Deduplication Based on Speculative Jump

نویسندگان

چکیده

In data deduplication systems, chunking has a significant impact on the ratio and throughput. Existing Content-Defined Chunking (CDC) approaches exploit sliding window to calculate rolling hashes of input stream byte-by-byte, then determine chunk cut-points if hash satisfies given cut-condition. Since previous CDC are extremely costly, it often significantly degrades throughput systems. this paper, we argue that calculating checking byte-by-byte is unnecessary. To reduce CPU overhead CDC, propose xmlns:xlink="http://www.w3.org/1999/xlink">jump-based chunking (JC) approach. The key idea introduce jump-condition, can jump over specific length satisfy jump-condition. Moreover, also explore cut-condition jump-condition size. Our theoretic studies demonstrate effectiveness efficiency JC, without compromising ratio. Experimental results show JC improves by about 2× average compared with state-of-the-art while still guaranteeing high

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication

Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems in the past 15 years or so due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cutpoints by computing and judging the rolling hashes of the data stream byte by byte. In this paper, we propose FastCDC, a Fast and eff...

متن کامل

Bimodal Content Defined Chunking for Backup Streams

Data deduplication has become a popular technology for reducing the amount of storage space necessary for backup and archival data. Content defined chunking (CDC) techniques are well established methods of separating a data stream into variable-size chunks such that duplicate content has a good chance of being discovered irrespective of its position in the data stream. Requirements for CDC incl...

متن کامل

Leap-based Content Defined Chunking - Theory and Implementation

Content Defined Chunking (CDC) is an important component in data deduplication, which affects both the deduplication ratio as well as deduplication performance. The sliding-window-based CDC algorithm and its variants have been the most popular CDC algorithms for the last 15 years. However, their performance is limited in certain application scenarios since they have to slide byte by byte. The a...

متن کامل

Two Stage Max Gain Content Defined Chunking for De- duplication

––Data de-duplication is a very simple concept with very smart technology associated in it. The data blocks are stored only once, de-duplication systems decrease storage consumption by identifying distinct chunks of data with identical content. They then store a single copy of the chunk along with metadata about how to reconstruct the original files from the chunks, this takes up the less stora...

متن کامل

A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System

Longxiang Wang 1, Xiaoshe Dong 1, Xingjun Zhang 1,*, Fuliang Guo 1, Yinfeng Wang 2 and Weifeng Gong 3 1 The School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China; [email protected] (L.W.); [email protected] (X.D.); [email protected] (F.G.) 2 The Shenzhen Institute of Information Technology, Shenzhen, 518172, China; wangyi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Parallel and Distributed Systems

سال: 2023

ISSN: ['1045-9219', '1558-2183', '2161-9883']

DOI: https://doi.org/10.1109/tpds.2023.3290770